Unsupervised Part-of-Speech Tagging Employing Efficient Graph Clustering
نویسنده
چکیده
An unsupervised part-of-speech (POS) tagging system that relies on graph clustering methods is described. Unlike in current state-of-the-art approaches, the kind and number of different tags is generated by the method itself. We compute and merge two partitionings of word graphs: one based on context similarity of high frequency words, another on log-likelihood statistics for words of lower frequencies. Using the resulting word clusters as a lexicon, a Viterbi POS tagger is trained, which is refined by a morphological component. The approach is evaluated on three different languages by measuring agreement with existing taggers.
منابع مشابه
SVD and Clustering for Unsupervised POS Tagging
We revisit the algorithm of Schütze (1995) for unsupervised part-of-speech tagging. The algorithm uses reduced-rank singular value decomposition followed by clustering to extract latent features from context distributions. As implemented here, it achieves state-of-the-art tagging accuracy at considerably less cost than more recent methods. It can also produce a range of finer-grained taggings, ...
متن کاملUsing Morphological and Distributional Cues for Inductive Part-of-Speech Tagging
In this paper we evaluate the role of morphological and distributional cues in PoS induction, using an incremental and unsupervised learning algorithm with clustering on a vector space.
متن کاملUnsupervised Part-of-Speech Induction
Part-of-Speech (POS) tagging is an old and fundamental task in natural language processing. While supervised POS taggers have shown promising accuracy, it is not always feasible to use supervised methods due to lack of labeled data. In this project, we attempt to unsurprisingly induce POS tags by iteratively looking for a recurring pattern of words through a hierarchical agglomerative clusterin...
متن کاملSphere Embedding: An Application to Part-of-Speech Induction
Motivated by an application to unsupervised part-of-speech tagging, we present an algorithm for the Euclidean embedding of large sets of categorical data based on co-occurrence statistics. We use the CODE model of Globerson et al. but constrain the embedding to lie on a highdimensional unit sphere. This constraint allows for efficient optimization, even in the case of large datasets and high em...
متن کاملKnowledge-free Verb Detection through Sentence Sequence Alignment
We present an algorithm for verb detection of a language in question in a completely unsupervised manner. First, a shallow parser is applied to identify – amongst others – noun and prepositional phrases. Afterwards, a tag alignment algorithm will reveal fixed points within the structures which turn out to be verbs. Results of corresponding experiments are given for English and German corpora em...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006